JG20ReD^d PCT/PTO 1 7 AUG 200f 



U.S. DEPARTMENT OF COMMERCE PATENT AND TRADEMARK OFFICE 

TRANSMITTAL LETTER TO THE UNITED STATES 
DESIGNATED/ELECTED OFFICE (DO/EO/US) 
CONCERNING A FILING UNDER 35 U.S.C. 371 



ATTORNEY'S IX3CKET NUMBER 
60556-303420 



U.S. APPLICATION NO, QF KNOWN, SEE 37 CFR 

09/913921 



INTERNATIONAL APPLICATION NO. 
PCT/GBOO/00492 



INTERNATIONAL iTLING DATE 

16 February 2000 f 16.02.00) 



PRIORITY DATE CLAIMED 

19 Febmary 1999 (19.02.99) 



TITLE OF INVENTION 

Matching Engine 



APPLICANT(S) FOR DO/EO/US 

TURNER, Michael et al. 



.sis. 



Applicant herewith submits to the United States Designated/Elected Office (DO/EOAJS) the following items and otber information: 
1. 8 TTus is a FIRST submission of items concerning a filing under 35 U.S.C. 371. 

□ This is a SECOND or SUBSEQUENT submission of items concerning a filing under 35 U.S.C. 371. 

□ This is an express request to begin national examination procedures (35 U.S.C. 371(f)). The submission must include itens (5), (6), 
(9) and (24) indicated below. 

□ The US has been elected by the expiration of 19 months from the priority date (Article 31). 
K A copy of the International Application as ffled (35 U.S.C. 371 (c) (2)) 

a. K is attached hereto (required only ifnot communicated by the International Bureau). 

b. □ has been communicated by the International Bureau. 

c. □ is not required, as the application was filed in the United States Receiving Office (RO/US). 
r.i,6. □ An English language translation of the International Application as filed (35 U.S.C. 371(c)(2)). 

^i^ a. □ is attached hereto. 

: | b. □ has been previously submitted under 35 U.S.C. 154(d)(4). 

;.;:7. □ Amendments to the claims of the Mtmiational Application under PCT Article 19 (35 U.S.C. 371 (c)(3)) 
J a. □ are att^ed hereto (required only ifnot communicated by the International Bureau). 

5 b. □ have been communicated by the International Bureau, 

c. □ have not been made; however, the time limit for making such amendments has NOT expired. 

d. □ have not been made and will not be made. 

8. □ An English language translation of the amendments to the claims under PCT Article 19 (35 U.S.C. 371(c)(3)). 

9. K An oath or declaration of the inventor(s) (35 U.S.C. 371 (c)(4)). 

10. □ An English langua^ translation of the aimexes of Ihe International Preliminaiy Examination Report under PCT 

Article 36 (35 U.S.C. 371 (c)(5)). 

11. □ A copyofthe International Preliminary Examination Report (PCT/IPEA/409). 

12. OA copy of the International Search Report (PCT/ISA/2 1 0). 
Items 13 to 20 below concern document(s) or information included: 



13. 


□ 


An Information Disclosure Statanent under 37 CFR 1.97 ar 


Id 1.98. 


14. 


□ 


An assignment document for recording. A separate cover sh 


set in compliance with 37 CFR 3.28 and 3,3 1 is included. 


15. 


□ 


A FIRST preliminary amendment. 




16. 


□ 


A SECOND or SUBSEQUENT preliminary amendment. 




17. 


□ 


A substitute specification. 




18. 


□ 


A change of power of attorney and/or address letter. 




19. 


□ 


A computer-readable form of the sequence listing in accorda) 


tice with PCT Rule 13ter.2 and 35 U.S.C. 1.821 - 1.825, 


20. 


□ 


A second copy of the published intematicmal application under 35 U.S.C. 154(d)(4). 


21. 


□ 


A second copy of the EngUsh language translation of the international application under 35 U.S.C. 154(d)(4). 


22. 




Certificate of Mailing by Express Mail 




23. 


m 


Other items or information: 








Postcard 





Page 1 of 2 



PCTUS1ffiEV03 



518 Rai'dPCEP^ 1 7 AU6 200 



S.S. APPLICATION NO. (IFKNOWN, 37 CFi 

09/913921 



INTCRNATIONAL APPLICATION NO, 
PCT/GBOO/00492 



ATTORNEY'S DOCBCET NUMBER 
60556-303420 



24. The following fees are submitted:. 

BASIC NATIONAL FEE ( 37 CFR 1.492 (a) (1) - (5)) : 

□ Neither international preliminary examination fee (37 CFR 1 .482) nor 
international search fee (37 CFR L445(a)(2)) paid to USPTO 
and International Search Report not prepared by the EPO or JPO 

K International preliminary examination fee (37 CFR 1.482) not paid to 
USPTO bvrt International Search Report prepared by the EPO or JPO . 



□ hitemational preliminary examination fee (37 CFR 1.482) not paid to USPTO 
but international search fee (37 CFR 1.445(a)(2)) paid to USPTO 

□ Mtemational preliminary examination fee (37 CFR 1.482) paid to USPTO 
but all claims did not satisfy provisions of PCT Article 33(l)-(4) 



□ International preliminary examination fee (37 CFR 1 .482) paid to USPTO 
and all claims satisfied provisions of PCT Article 33(i)-(4) 



S10(M).00 



$860.00 



S710.00 



$690.00 



$100.00 



ENTER APPROPRIATE BASIC FEE AMOUNT = 



CALCULATIONS PTC USE ONLY 



S860.00 



Surcharge of $130.00 for fiunishing the oath or (teclaration later than 
months from the earliest claimed priori^ date (37 CFR 1 .492 (e)). 



□ 20 



□ 30 



so.oo 



--^CLAIMS 



NUMBER FILED 



NUMBER EXTRA 



RATE 



ttital claims 



-20 = 



$0.00 



^dependent claims 



$0.00 



Multiple Etependent Claims (check if applicable). 



□ 



$0.00 



TOTAL OF ABOVE CALCULATIONS = 



$860.00 



m Applicant claims smaU entity status. (See 37 CFR 1.27). The fees indicated above are 
reduced by 1/2. 



$0.00 



SUBTOTAL 



$860.00 



Hbcessmg fee of $130.00 for fUmishing the English translation later than 
aidnths from the earliest claimed priority date (37 CFR 1 .492 (f)). 



□ 20 



O 30 



$0.00 



TOTAL NATIONAL FEE = 



$860.00 



.. lie for recording the enclosed assignment (37 CFR 1.21(h)). The assignment must be 
^f^ompanied by an appropriate cover sheet (37 CFR 3.28, 3,3 1) (check if applicable). 



□ 



$0.00 



TOTAL FEES ENCLOSED = 



$860.00 



Amount tt 



(unt to 
refund 



charged 



a. Q A check in the amount of 



to cover the above fees is enclosed. 



$860.00 



to cover the above fees. 



b. M Please charge my Deposit Account No. 02-3964 in the amount of 

A duplicate copy of this sheet is enclosed. 

c. Q The Commissioner is hereby authorized to charge any additional fees which may be required, or credit any overpayment 

to Deposit Account No. 02-3964 A duplicate copy of this she^ is enclosed. 

d. □ Fees are to be charged to a credit card. WARNING: Information on this form may become public. Credit card 

information should not be included on this form. Provide credit card information and authorization on PTO-2038. 

NOTE: Where an appropriate time limit under 37 CFR 1.494 or 1.495 has not been met, a petition to revive (37 CFR 
1.137(a) or (b)) must be filed and granted to restore the application to pending status. 

SEND ALL CORRESPONDENCE TO: 



Paul L. Hickman 

Oppenheimer Wolff & Donnelly LLP 
P. O. Box 52037 

Palo Alto, California 94303-0746 
United States of America 



SIGNATURE 
Paul L. Hickman 



NAME 
28,516 



REGISTRATION NUMBER 
17 August 2001 (17.08.01) 

DATE 



Page 2 of 2 



wo 00/49527 




,J,7AUG 



Matching Engine 



The present invention relates to a matching engine, and in 
particular ro an engine for identifying the best matches or 
sets of matches between a query item and one or taore items ±ti 
a ser ot data- 



currently, there are a multitude of matching techniques. 

These current techniques may be split into two broad 

categories: gradient-based methods and exhaustive search. 

Examples of the former include gradient descent, simulated 
annealing, relaxation labelling, neural networks and genetic 
algorithms. All of these teclmiques take a few initial best 
guess match solutions and refine thbm in order to obtain 
better solurions. 

The second category is exhaustive search techniques, in V7hich 
a large niamber of match solutions are examined by coarsely 
sampling the solution space, and tne best solution chosen. An 
example of an exhaustive search technique is the fast access 
method called geometric hashing. 



Tbere are problems associated with, both of the above 
categories of techniques. They are slow and give poor ^ 
performance on non-trivial matching problems . There are^^'a 
number of reasons for this poor performance. Gradient-based 
methods depend critically on obtaining a good initial 
solution; i.e. initial— guess match or transformation. 
However, this is not always possible as obtaining a good 
match is the final aim of the technique. Exhaustive search 
methods are dependent on the resolution with which the 
solution space is searched. For matching, the space is 
exponential in the number of nodes, making it very imlikely 
that a good solution can be found in a practicable time. 
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According ro a first aspect of the invention there is 
provided a method of identifying the best matches, or best 
sets of matches, between a querv item and one or more items 
from a data set, comprising the steps of providing a data 
representation of each item in the data set/ providing a 
query representation of the query ineiti, providing a 
parameter ised transf omiation space, for each of s number of 
overlapping regions of the trans format ion space spanning the 
entire transformation space, determining an upper bound to 
the probability of a match between the query representation 
and the data representation under any transformation 
contained in the region, determining a threshold probability, 
comparing the upper probability bound of each region with tHe 
threshold probability and determining regions of the 
transformation space having an upper probability bound 
greater than the threshold probability, so as to identify 
solution regions. 

The matching engine method of the invention provides a 
process which leads to the disovery of better solutions to 
matching problems; i.e. identifying objects with similar 
features. The method includes the steps sketching an upper 
boundary of all of the solution horizon, by obtaining an 
upper bound probability for large, overlapping regions of the 
space, thereby ensuring that the entire space is covered. 
Given this coarse sketch it is possible to eliiainate highly 
implausible regions of the solution space and resketch the 
.new upper boundary, by computing a threshold and eliminating 
regions of the space that fall below that threshold. The 
sketch and eliminate process can be repeated so as to 
naturally hone in on the diverse good solutions to the 
matching problem. 
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Once the probability of a match between tha query item and 
the item from the data set has been determined by the 
identification of a solution region, the item from the data 
sec can be identified as either being a plausible match ox 
not based on a further criteria. The remaining items from 
the data set can then also be evaluated to identify either 
the best matching data item or the set of best matching data 
items from the entire data set. 

Decisions about the solution horizon are no longer forced, 
but emerge naturally as processing proceeds. The invention 
provides a nxomber of advantages compared to conventional 
approaches. The method delays and softens decision making, 
allowing many interpretations to be maintained early on in 
processing, and to be passed on for subsequent processing. 
Fewer cycles can be employed dramatically reducing processing 
resource requirements. The method can handle high 
dimensional, complex data without difficulty because as the 
ntmber of dimensions increases it is a simple matter to 
correspondingly increase the size of the sketched regions - 
The method has a stxong theoretical framework underpinned by 
probability theory. 

Moreover, the method not only provides better performance 
within a module, it allows for step-change improvements 
within systems as a whole. Conventionally, system processing 
consists of passing best-guess solutions through a sequence 
of modules; i.e. the best guess output from one module forms 
an input to its neighbour. Since the best guess solution is 
often not the best actual solution, errors propagate and 
multiply, and cannot be subsequently rectified'. Accoxaing to 
the invention, not just the best guess, but all plausible 
solutions (i.e., those above a threshold] are passed between 
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xaodules wirhout coiaproittising computational resovirces. It is 
only later on in processing when additional information has 
been brought to bear that solutions are excluded. The result 
is that good, diverse solutions naturally emerge from a 
system utilising the method. 

The method can include the further steps of sub-dividing the 
solution regions into further regions which span the solution 
regions, determining a new upper bound, determining a new 
threshold probability and determining new solution regions. 
Repetition of the sketching and elimination process in the 
solution regions of the solution space containing plausible 
solutions enables all the plausible solutions in the 
transformation space to be more accuirately identified. 

The method can include the step of iterating the further 
method Steps so as to identify the region, of the 
transformation space containing the best match between the 
query and data set item. By repeated iteration rhe method 
can result in identifying a region containing the best 
solution or, depending on the termination criteria of the 
method a set of solution regions containing the best 
solutions can be identified. 

The method can be applied to a single item in the data set or 
can be carried out for each ot the individual items in the 
data set, ot for a selected subset of iteros from, the data 
set. 

The method can terminate when all upper bounds of the 
solution regions exceed the threshold probabilities. The 
threshold can be heuristically increased to restart the 
determination process on the remaining solution regions or 
solution representations can be recorded and/or processed in 
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a corxventionaL way- The wetihod can include the step of 
applying a gradient-based technique to determine a local 
itiaximuni. This is acceptable as a final stage as the solution 
regions will only contain the plausible solutions. 

The data representations can be topological representations 
of the data items and the query representation can be a 
topological representation of the query item- In using a 
spatial or topological representation of the data items and 
query item, the matching method is essentially one of pattern 
recognition . 

The topological representation of the data items and query 
item can comprises a set of node measurement vectors, each 
node measurement vector being associated with a node of a 
topological arrangement of nodes defining the items. The 
data items to be searched and the query item to be matched 
with can have their properties defined by a set of 
topologically or spatially arranged nodes. A set of node 
measurement vectors for each item can then provide the 
representation of that item which is used in the matching 
method. The matching is then achieved essentially through 
pattern recognition. The method is a generally applicable to 
matching patterns which can be held in computer memory. 

The upper bound can be determined using Bayesian probability 
tneory- 

According to a further aspect of the invention there is 
provided a matching engine for identifying matches between a 
query item and an item or itaais from a data set, the engine 
comprising electronic data processing apparatus including a 
memory storing a set of data representations of each Item in 
the data set, an input for inputting a query representation 
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of the query item and a processor which incl-udes means for 
defining a parameterised transformation space, means for 
generating a number of overlapping regions of transformation 
space spanning -the entire transformation space, means £or 
de-tienaining for each region an upper bound to the probability 
of a match between the query representation and a data 
representation under any transformation in the region, means 
for deterinining a threshold probability, a comparison means 
which compares the upper probability bound for each region 
with the rhreshold probability, means to identify solution 
regions having an upper probability bound greater than the 
rhreshold probability, and means to store an identification 
derived from the solution region of the match between the 
query item snd data set item in a memory. 

According to a further aspect of the invention there is 
provided a computer program which when running on a computer 
carries out a method according to the first aspect of the 
invention. According to a yet further aspect of the 
invention there is provided a computer program which when 
loaded into a computer provides a matching engine according 
to the second aspect of the invention. 

According to a -further aspect of the invention there is 
provided computer program code for identifying an item or 
items from a data set, the code including instructions for 
carrying out the functions of providing a data representation 
of each item in the data set, providing a query 
representation of a query item, defining a parameterised 
transformation space, for each of a number of overlapping 
regions of the transformation space spanning the entire 
space, determining an upper bo\and to the probability of a 
match between the query representation and a data 
representation \ander any transformation in the region. 
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determining a threshold probability, comparing the Tipper 
probability bound or each region with the threshold 
probability so as to identify solution regions which do 
contain soXutiens which march rh© datatoase item to the query 
item. 

According to a further aspect of the invention there is 
provided a computer readable medium storing computer program 
code according to the above aspect of the invention. The 
medium can be a permanent, semi-permanent, or temporary 
storage or meiaory device^ or can be an electrical signal 
transmitted by wireline or wirelessly. 

An eiEibodiment of the invention will now be described in 
detail, by way of example only, and with reference to the 
accompanying drawings, in which: 

Figures la,to,C s. d shows a series of solticion space 
diagrams illustrating steps of the method according to 
the invention; and 

Figure 2 shows a flow chart schematically illustrating a 
software aspect of the invention. 

As an exajuple, - the problem of automatically matching 
molecules in order to maximise some similarity criterion will' 
be discussed. Th-is is an important problem in the drug 
development process. Chemists will have a 'query molecule' 
of known behaviour and wish to use it to search a database 
for similar molecules. This can be viewed as an optimisation 
problem i.e., finding the best alignments (matches, 
transformations) between a query item and a database oJ! items 
(molecules) from a large number of possible molecules and 
their alignments. The query item molecule and database 
molecule items can be represented as patterns by placing 
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nodes ar regular intervals on their surface, and a 
measurement vector (containing characteristic properties of 
the molecule, e.g. spatial and eletrostatic inf oriaation) can 
be associated with each node. Thus, a pattern ma.tcliing 
problem results. 

In this context the term node is considejred to mean a 
discrete labeled object with an associated measurement 
vector. Further, the term measurement vector is considered to 
mean a list of feature-value pairs, which may include, for 
example, the feature of spatial location and its value in 
some co-ordinate system. 

V?e now discuss in more detail the example problem, 
considering for clarity only the problem of matching the 
query item against a single database item at a time. It 
should be noted that the invention lends itself to matching 
■Che query item against multiple database itCTis 

simultaneously, as will be appreciated once we have disussed 
the single item case. 

Figure 1 shows a series of sketches of a solution surface for 
this problem. The x-axis represents the possible alignments 
of the query molecule with a molecule in the cLatabase and the 
y-axis represents the similarity or goodness fit for all the 
different alignments. Each point on the curve represents the 
goodness of £it of the query molecule to the database 
molecule under a possible tiransf ormations {i.e. the curve may 
be thought to sketch out the similarity between the 
properties of the moleule as one is rotated or translated 
relative to the other) . The peaks and troughs represent good 
and bad fits respectively between two molecular structures, 
and the aim is to find the highest peaks . 
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As discussed previously, conventional techniques for 
optimisation can be grouped into two general categories - 
exhaustive search andL gradient-based methods . Eittiaustive 
search techniques, for example geometric hashing and gnomonic 
projection, rry to identiry peaks toy' Jumping incrementally on 
the solution surface. The number of good solutions that can 
be identified relates directly to the step resolution. While 
it is theoretically possible to find all the good solutions 
by letting the step increment tend to zero, in practice this 
results in a corresponding exponential increase in processing 
resource requirement (typically processor speed and meiuory 
requirements) . There is an unfavourable trade off between 
speed to a solution and quality of the result. 

Conventionally, gradient based method have been the only 
alternative to exhaustive search technicjues. They Include 
gradient descent, simulated annealing, neural networks, the 
Expectation Maximisation (EM) algorithm and Genetic 
Algorithms (GAs) , as examples. At each incremental step a 
routine is activated which ascends up to a local peak and 
identifies its location. Having found one peak it may jump 
through another increment and the process is repeated . 
However, like the exhaustive search technique it is limited 
in that the quality of solution is balanced against speed or 
processing. In particular, the quality of the solutions found 
depends upon where on the solution horizon the ascent is 
started. A good solution can only be found if a reasonable 
solution is known beforehand, which is not the case in 
general. Processing usually begins at some random position 
leading to a poor solution on termination. 

Since all drug development technology is based on exhaustive 
search or gradient-based methods, the discovery process is 
time-consuming and expensive since poor performance means 
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ctiaTi many cycles are necessary between experiments and 
computational analysis to tione in on a suitably active 
compound. 

The present invention delivers a step-change in technology to 
speed up tlie drug development process. In particular, it 
provides an engine for searching and comparing molecules held 
in large 3D chemical databases. In practice, the engine has 
be'en found to carry out an analysis over 1,500 times faster 
than conventional cotamercially available packages operating 
on the same hardware. This allows large databases to be 
searched in seconds rather than days, and opens the way to 
truly interactive computational drug design on the desktop. 



Moreover, the invention gives better quality analyses, in 
that it identifies a better set of molecules to test 
experimentally. This in turn reduces the number of cycles 
tnar are needed in the development process, leading to faster 
and more cost-effective drug development. 



The invention provides a new method of matching which is fast 
and gives good performance. The approach is based on a new 
approach to pattern recognition based upon four key factors. 
The matching problem is formulated as one of finding the best 
set of transformations between the nodes in two patterns. 
Calculations used in the method are underpinned by Bayesian 
probability theory. The method is holistic in that it 
requires that all possible solutions must be exaiained. The 
data processing is resource-driven such that the calculations 
that can be performed are constrained by the memory available 
and the speed of operations required, as defined by the 
operator. 
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The latter two considerations could lead to the conundrum of 
how to look at an exponential number of solutions quickly and 
efficiently. This is overcome by collecting solutions 
together into a- small number of {typically overlapping) 
s-obsets ot: regions of the total set of possible solutions, 
and assessing each region or sutoser in turn. There are a 
nuBober of estimates that may be made on a region, and an 
effective strategy that is consistent with the processing 
resource constraint, allows a trade-off between speed and 
accuracy by obtaining upper and lower bound jscores 

(probabilities) for any solution contained with a region or 

subset - 

Given these conditions, the optimal strategy to take is to 
eliminate regions if their upper bound falls below the 
highest lower bound. This guarantees that the optimal 
solution will be retained. By repeating this operation it is 
possible to hone in on interesting regions of the solution 
space by excluding sub-optim-al solutions. The remaining 
solutions may be re-examined in increasing detail as 
processing proceeds and as the processing constraint 
condition allows. The process terminates when all upper 
bounds exceed the lower bound threshold. At this point the 

lower bound m&y be heuristically increased to re-start the 
elimination process, or alternatively the remaining 
transformations may be recorded and processed, in sosae 
conventional way. Typically a gradient-based approach can be 
employed since the regions that remain will contain the peaks 
of interest- Once the march between the query molecule and 
that molecule has been assessed other molecules in the 
database can also be processed to assess their goodness of 
match. 



wo 00/49527 



12 



PCT/GBOO/00492 



With reference to Figures la to a brief schematic 
illusrranion of rhe general features of the" method will be 
given before giving a more detailed description of the 
method. In Figure la, the y axis represents the goodness of 
fit or the probability of a match . The x-axis represents the 
set of all allowed transformations Ce,g. rotations, 
transformations) between molecules. The query molecule for 
which a match is to be identified is represented as a query 
representation. The molecule from the database or data set 
with which the query molecule is being compared is 
represented as a data representation. Th» curve 100 is an 
indication of the closeness of the match between the 
representation of the query molecule with the representation 
of the database molecule under different transformations. 
The problem is to identify the peaks in the curve 
representing plausible solutions without omitting any 
plausible solutions in a practicable manner. 

Firstly, the set of transformations is divided into a number 
of regions A to H which span the entire transformation space. 
For each of those regions an upper bound to the probability 
of the match between the data representation and the query 
representation under any tians format ion in the regions is 
calculated using BayeiSian probability theory. The results of 
such a calculation are shown as line lio. a threshold 
probability is then calculated as shown by dashed line 120. 
Those regions having their upper probability bound 110 
falling below the threshold 120, in this case subsets K, C, 
E, F and H are then removed as there are clearly better 
matches available within solution subsets B, D and G. 

As illustrated in Figure lb transformation regions B, D and G 
are then subdivided into a number of further regions: B',B" 
and B'", D',D", D' " and D"- and G' . A new upper bound on the 
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probability o£ inarching with XX^e. query representation is 
detenaijied for each of the regions as illustrated by lines 
122, 124 and 12S. A new threshold probability is calculated, 
as illustrated by line 128. Again, those regions falling 
below the threshold value are removed from the solution space 
such that only solution regions B' , b" and D' " remain far 
further processing. At this stage the process could be 
terminated and the solutions containing identified matches 
given by the molecule and its trans formations falling within 
solution regions b' , B" and D' " could be saved, r-esulting in 
a set of regions containing the best £it solutions- The 
molecule can then be identified as one providing an 
acceptable match dependent on some further matching criteria- 

Alternatively, a rurther iteration of the process co-uld be 
carried out as illustrated in Figure Ic. Further upper 
probability bounds 130 and 132 for subsets B"" and are 
calculated .and compared, with a newly derived probaJoility 
threshold to identify solution region B"". In a final step a 
gradient laethod is utilised to find the local maximum 
solution representation which has a corresponding 
transformation identified as giving the best match to the 
query molecule. The match with the remaining molecules in 
the database can then be assessed individually. 

It will be appreciated from the above discussion that the 

invention lends itself to matching the query item against 
multiple database items simultaneously. In this case the 
solution surface is simply a concatenation of the solution 
surfaces for each individual database item. Simply, the same 
procedure as described is followed with the addition that the 
sketch and elimination process is applied across the whole of 
the concatenated solution surface . Matching the query item 
against multiple database items simultaneously can lead to a 
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more efficient mettiod. i£ ic allows laore efficient use to be 
made o£ coiaputer resources . 

Turning now to -the use of a spatial arrangement of nodes to 
represent the ctiaracterisric features of the molecules which 
provide the pattern to be matched by the metliod. Consider a 
pattern labelled by a set of N nodes. The nodes have an 
associated set of measurement vectors, x={Xi, ,Xn}. 

In order to match the pattern against a second, consider the 
global set of transformations which map the nodes in the 
first pattern onto the second and is denoted by 
w^fwj^ , . , . ,wu} . From the first condition discussed above, the 
aim is to find the best global solution, i.e., the best set 
of transformations from the nodes in this pattern to a second 
pattern, -where, from the second and third conditions an 
holistic, probability theory approach, is used which 
requires ; 

w— arg- max ?eu P (W=?\x) 
(I) 

Where w is the space of possible solutions for w. in other 
words, all of the solution space is considered, making no a 
fa^iojri assumptions about where or how often to search. 

Note there is no aim to locate the best solution directly, 
i.e., by actively searching for or refining solutions within 

this being the approach of existing gradient -based or 
exhaustive search techniques. Rather, the metnod achieves 
the same aim indirectly, by eliminating bad solutions froni f/. 
in doing so all of the solution space is implicitly examined, 
as required by the third condition. This is achieved as 
follows - 
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Solutions are collected together since examining each, 
individual solution in isolation would be computarionally 
intractable in -general. This is done by considering all 
solutions that contain the indi^ridual trans lormat ion Wi=a, 
say, i.e./ all solutions where the transf onaation for node 1 
is fixed to be Wj=a (or, more precisely, in some small 
vicinity thereof) , but the transformations of all other nodes 
may vary. The lowest: upper bound for any one of these 
solutions (i.e., a region of the solution space) is such 
that: 

(2) 

■where w' denotes the transformations on all nodes excluding 
that under consideration, and W is rhe space of all possible 
transformations for this set- 

Any region whose upper bound probability is below some knovm 
lower bomd value, L, say, of interest cannot contain the 
optimum solution. Thereforey it is possible to eliminate 
these regions from consideration. Therefore the rule at some 
iteration time •« is; 

eliminate the region containing the transfamtation w/=s if 
(3) 

This is the key to the method: an upper bound on the 
probability of region of the solution space can be computed. 
(At the onset the whole of the solution space can be covered, 
generating an upper bound sketch as in Figure la) . Each 
region or subset can then be compared against a lower bound 
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th.ireshold- If the Upper bound falls below the threshold the 
region can t>e eliminated since it cannot contain a good 
solution - 

The computation of the upper bound has not yet been defined, 
and in general may toe compurarlonally expensive. In order to 
provide a computational practicable method, a soiutioii is to 
identify quantities of the £orm g'"* (Wi=a) such that 
g'"^ (wi-'a.)>= U""^ (Wi-a) which can be computed in & given time, 
xn other words, rather than compute the lowest, upper bound, 
U, some upper bound, G, is computsd. Thus, computational 
resources drive processing and provides a computationally 
tractable method which can be used to provide real time 
results. The laethod can provide an optimal use of allowed 
coinputa"Cionai resources when G is as close to u as possitole. 
The elimination rule then becomes : 

eliminate the region containing the iransfortnation Wf=a. if 

gI-> ^Wi=a>< I.'"' 
(4) 

G^"' is evaluated by combining Bayesian probability theory 
with rules of inequality. Its form may change over the 
iterative cycles in order to accommodate the coiuputational 
resource requirement. For example, at the onset of 
processing g'^^ may be coarsely and quickly evaluated, 
providing a coarse upper bound sketch (Figure la) but 
provided it obeys g'"^>^ U'"' then only bad solutions will be 
eliminated.. 

This frees up resources so that the surviving solution space 
or solution subsets can be examined in more detail if 
required. It also allows for lower upper bounds to be 
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computed at the next iteration since there is less 
interference in the system since the elimination of one 
region affects the bound computed £or overlapping regions at 
the next Tixoie step. 

Towards the end of procearsing when only a few solutions 
remain, a more sophisticated and computationally intensive 

means of computing G'"' laay be employed, sucti char c'"^ 
approximates Ii'"'' provided the fourth condition is not 
violated. 

Processing will continue until no solutions fall below the 
threshold. 

At any time processing may be re-started, by heuristically 
increasing the thresholds or alternatively, the zremaining 
transformations may b© recorded and processed in some manner. 

In essence G is computed to sketch the solution surface, 
which is compared against the threshold L to eliminate 
iininteresting regions of the space. No other method is known 
of which uses such an holistic sketch and elimination 
process . 

The example the method so far discussed is retrieval of bio- 
active compounds from chemical databases by using one or more 
query or lead compounds a cue. The starting point is to 
represent query and database compounds as patterns, each 
identified by a set of spatially or topologically arranged 
nodes, each node having an associated measurement vector - 

initially U(Wi='a) is defined and then an inequality is 
introduced to generate G(wj,=a). 
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The upper bound probability in equation <2)"caii be developed. 
By applying Bayes ' rule equation (2) becomes 

<5} 

Making the non-restrictive assximption that the measuremenr 

vectors x-fxi/ ^Xn) are independent when conditioned on the 

transformations w^fWf,. . . . ,w„} then this becomes 

w,= a.J/ip(x (6) 

An inequality is introduced to reduce computational 
complexity. An option is 

maxasA.beB P(a.h) <^ maxt,^AP(a) maxbe,BP(h) 
(7) 

which gives 
(8) 

Pj 1. 1 max seity p(Xj I Wj=R)P(wj=-R \w,=aj/p(x)= G^(w,=a) 

Where is the set of possible transf onaations for node j, 
and which reduces the complexity of the upper bound 
calculation from exponential to O (i^) . Alternative 
inequalities could be applied here leading to increases or 
decreases in complexity, as required. 



Equivalent to equation (4) is: 
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eliminate the transformation W!= a from zhe list W^"'^'^i If 
(9) 

Where G''''Yw,=aJ is given in equation (8) . 

Taking logarithms, the elimination rule then becomes: 
eliminate the tzansfajrmation w<-a from the list Tf^'^'^,, If 

(10) 

wh&rey"-' Ov/^aj is given by: 

5^^ (yy,=a}= log (p(x, \ yi^,=a)P(w,=a) ) + 

(iJ) 

Sj>~, maxi^ »r/^ logp{xj j wj'='B)P(^j=& \ w,=a)-c 

Where c—logp(x) is a constant and the algorithm can be 
applied to all candidate transformations at all nodes, 
synchronously ox asynchronous ly 

Applic&tion of the method requires models £or the 
distributions and priors in equation (11) . For the 
appltcarion of tnolscule matching one alrsmariv© is 
rectilinear distributions with zero height away from their 
centre. In this case the support for an individual 
transformation is: 
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(12) 

for n>0, where is a constant and whers all solutions not 
compatible with the data have been eliminated at the onset. 
Here /i(Vj-a, >i'j-=->ff> is a binary compatibility measure, simply 
stating if the transformation a on node i is compatible with 
the solution R on node j at time n, Ihx^s ^"^ (Wi=a) 
essentially counts the niomber of nodes that may be 
consistent with the trans fornvation under consideration at 
node 1- 

'The procedure can combine the algorithm in (12) with 
geoiaetric hashing. It involves a storage 5tage in which 
database compounds are encoded in a hash table, and a recall 
stage in which a query compoimd is used to access the table, 
and regions are examined. Finally, a clustering or searching 
stage may be added to closely analyse remaining regions , 

When the method is embodied as a computer program the 
following functions are supported. 

The following steps are taken in storage for each database 
compound : 

generate the database compound nodes, and their measurement 
vectors to include node position and normal; 

generate a frame for each point using the centroid-position- 
normal triplet; 

align this frame to the world frame and store the compound in 
a hash table as compound— node— transformation triplets; 

The following steps are taken in recall: 
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generate th-e query compound to define the object nodes* their 
positions and normals; 

generate a frame for each node using the cent roid— position- 
normal triplet; 

align this frame to the world frame and access the hash 
table, assigning accessed transformations to each node; 
convert the transformation matrices to rotation parameters 
and store in a. hash table; 

use the sketch and eliminate procedure in equations (12) and 
(10) to eliminate implausible rot&tion solutions; 
cluster the remaining solutions and obtain a similarity index 
score for each by overlaying compounds 

Modifications to the description above for different 
applications occur at the level of modelling. This laay either 
be alterations to the form of the distributions assumed or to 
the measurement features employed. For example, in the 
molecule matching rectilinear distributions have been used 
but in this and other applications Gaussian distributions may 
be appropriate and, for example, curvature information may be 
employed. 

With reference ,to Figure 2 there is shown a schematic flow 
diagram 200 of a software implementation of an aspect of the 
invention. Initially a data molecule is selected from the 
database at step 210. The data molecule is then transformed 
into a data representation of that molecule 220 in the form 
of a set of node measurement vectors as described above. A 
representation of the query molecule is then generated 230 
again as a set of node measurement vectors . This step need 
not be repeated in subsequent runs, and once generated the 
cjuery representation may be stored for £urther use as 
required. 
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The march toerween tixe query and data representations is then 
detejrrained 240 by looking at the possifcile transformations 
between the query and data representations so as to identify 
possible solution regions in the trans formation space. This 
step may be iterated 245 so as to deteirmine only the best 
match or alteimatively to determine a set of best matches, as 
described above. 

A match criteria can then be applied 2S0 to the best or set 
ot best matches so as to determine whether the cjuery and data 
item match sufficiently well. If the query and data, item 
match sufficiently well then an indication of the data item 
and its goodness of laatch is stored 260 for future reference 
or pjTocessing. The remaining items in the data base can then 
be compared with the ciuery item 270 until all or a selected 
amount of the database has been searched. The results, which 
identify database compounds which sufficiently match the 
query compovind, can then be output 280. The results of all 
the attempted matches can be stored and arranged in order of 
goodness of match to identify a hierarchy of likely 
. compounds . 



Under different models and using different measurements there 
are a wide number of application areas for the matching 
engine of the invention. Each has at its core the problem of 
matching complex patterns. The matching engine can be used 
to identify features {items} in visual data sets, e.g. in 
medical image analysis, visual inspection and control, 3D 
reconstruction from video or film and 3D object monitoring in 
video or film. In visual data applications, the full data 
set of visual signals can be searched so as to identify - 
features in the video signals by matching the pattern of the 
feature being searched for with the patterns present in the 



wo O0/49S27 



23 



PCT/GBOO/OOaSJ 



video signals. h.s the nae-hod is holistic and covers the 
entire dara set^ -here is no loss ot definition in the video 
signals - 

For instance the Taatchincf engine could be used to identify a 
particuler articls/ e.g. a mug, in a streaiu of video signals. 
In this case, the mug would be the query item for which a 
topological query representation would be generated. The 
data Icem vjould rnen be a video £rame still . The location of 
the iriug in the video still picture could then be identified 
by the matching engine by searching through the video still 
data item by considering =.11 possible transformations of the 
mug representation and then identifying the mug in the video 
still. In rhis case the sequence of video still images would 
be the database items which could be searched in turn by the 
engine to identify the po-ential locations of the mug in the 
video images. The application ol the matching engine to 
identify patterns in medical images (both video and 
ultrasound) so as to locate body or tissue features will also 
be appreciated from this example. 

The matching engine can also find applications in the fields 
of DNA. and protein sequence matching as will be appreciated. 
The matching eziglne can also be applied to the field of time- 
series analysis/ for example, speech recognition, by matching 
patterns in current and old data sets and correlating those 
matches with the known text. 

It will be appreciated that the method is particularly suited 
to implementation as a computer program, and that suitably 
programmed electronic data processing apparatus will pro.vide 
a search engine capable of carrying out the pattern matching 
method as described. The detailed reguirements of a computer 
program embodying the method described herein are considered 
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to be within the abilities of a man of ordinary skill in the 

of coiqputer programming and so have not been described in 
any det,ail. 
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1- A merliod of identifying the best matches or sets of 

matches between a query item and an item or items from a 
data set, comprising the steps of: 

(i) providing a data representation for each item in the 
data set; 

(ii) providing a query representation of the query item; 

(iii) defining a transformation space; 

(iv) for each of a number of regions spanning the entire 
transf oriuation. space, determining an upper bound, to the 
probability of a match between the query representation 
and a data representation under any transformation in the 
region; 

(v) determining a threshold probability; 

(vi) comparing the upper protiai>ility bound of each region 
with the threshold probability; and 

(vii) d.etermining regions having an upper probability 
bound greater than the threshold probability, so as to 
identify solution regions. 

2. A method as claimed in claim 1/ and including the further 
sreps of: 

sub-dividing the solution regions into further regions 
which span the solution regions; 

determining a new upper bound; 

determining a new threshold prgbability; and 
detearmining new solution regions. 

3- A method as claimed in claim 2, including the step of 
iterating the further method steps of claim. 2 so as to 
identify the solution region containing the best matching 
solution or to identify a set of solution regions 
containing a set of best m.atching solutions. 
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4 . A method as claimed in claim 1, in. which the data 

representations are topological' representations of the 
data items and the query representatiion is a topological 
representation of the query item. 

5. A method as claimed in claim A, in which the topological 
representation of the data items and query item comprises 
a set of node measurement vectors, each node measurement 
■erector Joeing associated with a node of e topological 
arrangement of nodes defining the items. 

6. A method as claimed in claim 1, in which the upper bound 
is determined using Bayesian probability theory. 

7. A matching engine for identifying an item or items from a 
data set, the engine comprising electronic data 
processing apparatus including: 

a memory storing a data representations for each item in 
the data set; 

an input for inputing a query representation of the guery 
item; and 

a processor which includes means for defining a 
transformation space, means for generating a nimber of 
regions of the transformation space spanning the entire 
transformation space, means for determining for each 
region an upper bound to the probability of a match 
between the query representation and a data 
representation under any transformation in the region, 
means for determining a threshold probaliility, a 
comparison means which compares the upper probability 
bound for each region with the threshold probability, 
means to identify solution regions having an upper 
probability bound greater than the threshold probability. 
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and. means to store an identification of a match between 
the query item and the item of the data set in a memory. 

8. A computer program which when running on & coitipiatex 
carries out a method as claimed in claim 1 . 

9. Computer program code for identifying an item or items 
from a data set, tne code including instructions for carrying 
out the frinctions ofi 

(i) providing a set ot data represenrarions of each item 
in the data set; 

(ii) providing a ejuery representation of the cjuery item; 

(iii) defining a transformation space; 

(iii) for each of a number of regions of the 
transformation space spanning the transformation space, 
determining an upper bound to the probability of a match 
between the query representation &nd a data 
representation under any transformation in the region; 
{iv) determining a thireshold probability; 

(v) comparing the upper probability bound of each region 
with the threshold probability; and 

(vi) determining solution regions having an upper 
probability bound greater than the threshold probability, 
so as to identify the solution regions. 

10, A computer readable laediiim. storing computer code as 
claimed in claim 9. 
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